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1.  Introduction 


In  this  paper  we  ron«Har  the  problem  of  finding  the  n-bit  result  of  dividing  one  n-bit  number  by 
another.  We  present  circuits  with  asymptotically  small  size  and  depth  for  this  problem  and  we  derive  from 
them,  efficient  PRAM  algorithms  for  division  in  the  bit  model.  Our  primary  result,  which  is  a  size-efficient 
implementation  of  the  circuits  in  Reif  [Re86],  is  a  logspace  uniform  family  of  circuits  for  division  of  depth 
D  (n)=0  (lognlogk)g/i)  and  size  S(n)=0((l/84)vi1+*),  for  any  8>0.  This  translates  into  a  uniform  parallel 
algorithm  an  a  shared  memory  machine  (PRAM)  with  bit  operations  and  exclusive  memory  writes  with 
parallel  tiine  D  (n)  using  0  (5  (/»))  processors.  It  also  translates  into  a  parallel  algorithm  for  a  concurrent 
write  PRAM  with  parallel  time  0  (D  (»)/1oglogn)  using  0  (S  (»))  processors.  Finally,  we  apply  the  results 
of  Beame,  Cook  and  Hoover  [BeCoHo86]  to  obtain  a  polynomial-time  uniform  family  of  circuits  for  divi¬ 
sion  of  depth  0((l/82)-Iogn)  and  size  0  (5  (n)). 

2.  Parallel  Models  of  Computation 

A  (bounded  fan-in)  boolean  circuit  is  an  acyclic  labeled  digraph.  Nodes  are  labeled  as  input,  con¬ 
stant.  AND.  OR.  NOT,  or  output  nodes.  Input  and  constant  nodes  have  zero  fan-in,  AND  and  OR  nodes 
have  fan-in  of  2,  NOT  and  output  nodes  have  fan-in  erf  1.  Output  nodes  have  fan-out  zero. 

Let  £«(0,1).  A  boolean  circuit  with  n  input  and  m  output  nodes  computes  a  boolean  function 
The  size  of  a  boolean  circuit  is  the  number  of  nodes  in  the  circuit  excluding  input  and  output 
nodes.  The  depth  of  the  circuit  is  the  length  of  the  longest  path  among  all  paths  from  input  to  output  nodes. 
Given  a  sequence  of  circuits  Ci.Ci,  •  •  •  we  denote  the  size  of  the  *-tb  circuit  by  SIZE(C»)  and  its  depth 
by  DEPTHfCJ.  If  there  exists  a  function  S(n)  such  that  SIZE(C,J£S  (/»)  for  each  n  then  we  say  that  the 
size  of  the  sequence  is  0(S(n)).  Similarly  we  may  define  the  depth  O (D(n»  of  the  sequence.  We  say  a 
sequence  is  in  SIZE-DEPTH(S  (n)J>  («))  if  it  is  simultaneously  bounded  in  size  by  0(5  (n»  and  in  depth 
by  0(D  («)).  A  sequence  of  boolean  functions  will  be  referred  to  as  a  problem,  and  a  sequence  of  circuits 
such  that  the  a-th  circuit  realizes  the  n-th  boolean  function  is  an  algorithm  to  solve  the  problem.  We  will 
say  that  an  algorithm  gives  circuits  of  small  size  for  a  problem  if  5  (nH9  (f  (8)-n  '**),  for  any  5 >0  and  for 
some  function  /.  Small  size  circuits  are  desirable  since  they  lead  to  low  hardware  costs.  For  parallel  com¬ 
putation,  we  also  wantD(n)  to  be  small,  since D (a)  gives  the  parallel  computation  time. 


A  sequence  of  circuits  is  logspace  uniform  for  a  problem  if  there  exists  a  logspace  -bounded  turing 
machine  that  computes  a  suitable  binary  encoding  of  the  n-th  circuit  on  being  input  the  number  n  in  unary. 
For  theoretical  reasons  logspace  uniformity  is  a  desirable  property  in  a  sequence  of  circuits  (see,  e.g., 
[Ru81]). 

A  PRAM  in  the  bit  model  is  a  parallel  RAM  with  access  to  a  global  memory  and  with  each  processor 
capable  of  a  bit  operation  in  unit  time.  A  CREW  PRAM  is  a  PRAM  allowing  concurrent  reads  but  only 
exclusive  writes  on  the  global  memory.  A  CRCW  PRAM  allows  concurrent  reads  and  writes.  A  PRAM 
algorithm  is  uniform  if  the  algorithm  is  parametrized  by  n  and  works  in  the  parallel  time  bound  for  all 
values  of  n. 

3.  Previous  Work  on  Circuits  for  Basic  Arithmetic  Operations 

For  the  problem  of  adding  two  n-bit  numbers  Krapchenko  [Kr70],  and  and  Fischer  [LaFi80] 
present  algorithms  which  achieve  asymptotically  optimal  delay  of  logn  and  a  linear  size  bound  which  are 
the  best  possible  in  this  model  (see  [Sa76]). 

For  adding  n  n-bit  numbers  Ofman  [Of62]  and  Wallace  [Wa64]  have  0(logn)  depth  circuits  with 
0(n2)  gates,  which  is  linear  in  the  size  of  the  input. 

The  best  known  circuits  for  the  multiplication  of  two  n-bit  numbers  are  due  to  Schonhage  and 
Strassen  [ScSt71],  These  have  0(logn)  depth  and  CKnlojnloglogn)  size. 

In  the  division  problem,  we  need  to  compute  the  n-bit  quotient  a/v  where  u  and  v  are  n-bit  numbers. 
Since  u/v=«-(l/v),  and  multiplication  can  be  done  efficiently  by  the  Schonhage-S  trassen  algorithm,  atten¬ 
tion  has  been  concentrated  on  the  computation  of  the  n-bit  reciprocal  of  an  n-bit  number.  The  first  good 
circuits  for  the  reciprocal  problem  are  due  to  Cook  [C066].  The  method  used  in  [Co  66]  to  compute  the 
reciprocal  is  to  first  normalize  v  to  a  number  in  the  interval  [1/2,1),  set  x»l-v,  and  compute 
l/(l-x)«l+x+xJ+x3+.„  where  the  first  n  terms  of  the  series  give  sufficient  precisian.  The  problem  is  thus 
mkcedtotbmt^eOckatljcae^nt^x0  wtmx  'amihbkmtaber.  fCo66]  pKxm  a  0(k)g,n)  depth 
polynomial  size  family  of  circuits  for  this  problem.  Recent  attempts  to  obtain  better  circuits  for  division 
have  all  coocentnted  on  the  powering  problem. 
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A  log  depth  polynomial  size  circuit  for  division  has  been  obtained  in  [BeCoHo86],  which  solves  the 
powering  problem  by  using  a  combination  of  Chinese  remaindering  and  taking  logarithms  over  finite  fields. 
The  algorithm  does  not  appear  to  be  logspace  uniform,  though  the  circuits  are  polynomial  time  computable. 
Reif  [Re86]  has  logspace  uniform  Oflog/tloglogn)  depth  division  circuits  of  polynomial  size,  parametrized 
only  by  n,  so  that  the  algorithm  translates  into  a  uniform  CREW  PRAM  algorithm.  The  division  circuits  in 
both  [Re86]  and  [BeCoHo86]  are  worse  than  quadratic  in  size.  Thus  while  there  are  known  small  size  cir¬ 
cuits  for  addition  and  multiplication  of  0(logn)  depth,  this  is  not  the  case  for  division. 

In  the  next  section  we  present  circuits  of  size  0((l/8<)-n,+*),  for  any  5>0,  that  achieve  the  same 
depths  as  in  [Re86]  and  [BeCoHo86],  thus  obtaining  small  size  circuits  of  small  depth  for  the  division 
problem. 

We  close  this  section  with  a  brief  discussion  of  the  DFT,  presenting  the  definitions  and  theorems  that 
we  need  (for  further  details  and  proofs  see  [AhHoU174]) 

Let  R  be  a  commutative  ring  with  identity  1.  Then  the  set  of  all  infinite  sequences  from  R  with  only 
finitely  many  non-zero  terms  forms  a  commutative  ring  with  identity  1  under  componentwise  addition,  and 
multiplication  defined  by  convolution.  This  ring  is  called  the  ring  of  formal  polynomials  over  R  and  is 
denoted  by  R[tJ,  and  the  sequence  whose  (i+l)-th  term  is  non-zero  and  whose  later  terms  are  all  zero  is 
denoted  simply  by  (a0.ai,...,<tt)  and  also  often  written  as  a0+ait+fl2f2+~’*Vk. 

Let  R  have  a  primitive  n-th  root  of  1  (denoted  by  go),  where  n  is  a  unit  in  R,  i.e.  n  has  an  inverse  in 
R.  Let  the  n  x  n  matrix  M  *  (r «y)  be  defined  by  «y»tiW,  0,  l,~,n-l.  The  matrix  M  is  invertible.  If  A 
is  an  si-vector,  define  DfT.CA) -MA  and  DFT?  (A)  -  AT1  A.  Note  that  DFT?  (DfT„(A))  -  A.  [CoTu65] 
introduced  an  algorithm  which  translates  into  a  size  OOi3log/t),  depth  0(logft)  circuit  far  computing  the 
OFT  of  n  a-bit  numbers.  If  we  further  assume  that  there  exists  y  in  R  such  that  y*  ■  -1  and  y3  «  a>,  then 
with  any  (n-l)-degree  polynomial  A(r)^04tf  1H42*2+-..+a»-i0i-l  we  can  associate  A*  « 
(ao.fliy.ajy3 . fl,_ly-”1).  We  now  state 

The  negatively  wrapped  convolution  theorem:  Let  A^r),  t»l,2,...,r  be  («i-l)-degree  polynomials,  and  B  (0 

r  r 

■  being  **  "“I  product  in  R(il.  LetD(r)ufi(r)modr*+l.  ThenD’  •DFT?(YlDFTm{Am,)) 

•■i  >>i 
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where  the  multiplication  in  the  transformed  domain  is  componentwise. 

4.  The  Reciprocal  Problem 

Let  x  be  an  a-bit  number  in  [1/2,1).  We  wish  to  compute  v=l/x  correct  to  n  places  to  the  right  of  the 
point  Let  u=l-x,  where  0<u£l/2  and  u  has  a  bits.  Then  l/x=l/(l-u)=l+u+u2+u3+....  We  would  obtain 
1/x  to  sufficient  precision  if  we  compute  1,u,m2,u5,  . . .  exactly  (each  of  these  numbers  can  be 
represented  exactly  using  at  most  a2  bits  since  u  has  only  n  bits),  and  add  them  up  and  truncate  the  sum  to 
n  bits  to  the  right  of  the  point  Since  in  this  method  the  powers  are  computed  to  a 2  bits  of  precision,  the 
resulting  circuit  has  size  Q(n2).  However,  the  computation  of  the  powers  to  n2  bits  is  unnecessary,  as  is 
shown  in  [MePr86].  Consider  the  factorization: 

l+ll+«2+...+u•“,=(l+u+|^2+...+uJ■lXl+«<+u2f+...+u(,“l>*)...(l+^*“',+...+u^'"l)'"',) 

where  s=nVm,  m  being  a  fixed  integer.  Denote  the  i-th  factor  (l+it'<~l+«2'<~,+  *  •  •  +uc'~1><_1),  by  0/, 
i=0,  l,...,(m-l).  We  can  compute  each  factor  4/.  and  then  multiply  the  m  factors  and  truncate  the  result  to 
n+2  bits  to  get  1/x  Note  that  is  actually  the  sum  of  the  powers  (u*‘~ly  for  ;=0,l,2,„.,(r-l).  It  is  proved 
in  [MePr86]  that  if  we  use  an  a+log(12m)  bit  approximation  to  (which  we  denote  by  0,).  compute  0f , 

j=0 . j-1  exactly,  Le.  to  ns  bits,  and  add  these  s  numbers  to  get  an  ar-bit  approximation  to  4,',  then  the 

product  of  the  m  truncated  to  n  bits  gives  an  a-bit  approximation  to  1/x  The  0,  are  obtained  as  follows: 
0O  is  initialized  to  u,  and  0,  is  obtained  by  computing  0f_i  and  truncating  to  a+log(12ni)  bits.  Note  that  no 
computation  involves  more  than  ns  bits.  Though  the  above  factorization  is  not  valid  if  aw"-r  is  not  an 
integer,  this  algorithm  for  computing  the  reciprocal  is  still  good  with  s  ■  fn  ^  ,  since  this  only  means  that 
we  compute  even  mare  than  n  terms  in  the  series. 

Let  Si(r,a)  and  Ti(s,n)  denote  the  size  and  time  of  computing  each  fc.  If  S0(a)  is  the  size  of 
reciprocation,  then  S0(n)  ■  0(mS  i(s,n}+mM  (ns)),  where  the  first  term  on  the  RHS  is  the  cost  of  comput¬ 
ing  m  4,'.  and  the  second  term  is  the  cost  of  multiplying  them  together.  If  we  denote  the  depth  for  comput¬ 
ing  reciprocal  by  T0(n)  then  T0(n)  ■  0(mT i(r,/i>t-logmk>gnj),  where  the  first  term  is  the  depth  for  comput¬ 
ing  m  ^  and  the  second  term  the  depth  for  multiplying  them  together.  Since  4,-  is  the  sum  of  0/, 
7-0,1 — f-I,  with  0j  being  of  0(a)  bits,  if  we  denote  the  size  of  computing  the  r-tb  power  at  an  a-bit 
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n umber  by  Szfan),  we  find  that  S  i(s,ny=sS2(s,n}¥ns2 ,  where  the  first  term  is  the  cost  of  computing  the  s 
powers  of  6,  and  the  second  term  is  the  cost  of  adding  them  up.  Similarly  T:  (s,n)=T2(s,n)+\ogns,  where 
the  first  term  is  the  depth  of  computing  the  s  powers  of  8,-  in  parallel,  and  the  second  term  the  depth  of 
adding  them  up.  From  the  above  it  follows  that 

So(n)=smS2(s,n}+nms2+mM  (ns)  1 

and 

TQ(n)=mT2(s,ny*-mlogns  2 

We  now  develop  an  efficient  algorithm  for  computing  the  s-th  power  of  an  n-bit  number  and  determine  its 

size  S2(s,n)  and  time  T2(s,n).  We  actually  describe  an  algorithm  that  computes  the  j-th  power  of  an  r-bit 
number  modulo  2'+l.  This  will  be  used  to  compute  the  r-th  power  of  an  n-bit  number  exactly,  by  treating 
the  input  as  an  ns-bit  number  and  computing  its  j-th  power  mod  2**+l.  Our  algorithm  is  based  on  the 
modular  product  algorithm  in  [Re86]  and  uses  the  DFT.  We  first  introduce  some  notation.  The  ring  Z^+i 
has  k  as  a  unit,  o>=4  as  a  primitive  Jk-th  root  of  unity  and  X|t=2  «>ri<fii»«  y*=-i  and  Hence  DFT  and 
inverse  DFT  of  ^-sequences  can  be  defined  in  this  ring  and  by  the  notation  OFTtCio^i.  -A-i)  mod  2*+l 
we  mean  the  DFT  (as  previously  defined)  in  the  ring  with  w®4.  Denote  by  mod  2*+l  the 

vector  (xo,...  jct-i)  with  each  of  its  components  reduced  modulo  2*+l.  We  now  state  the  algorithm  and  fol¬ 
low  it  up  with  a  discussion. 

The  modular  power  algorithm 

Input  ljr-2  •  •  •  £o  is  of  r  bits,  the  least  significant  bit  being  and  the  most  significant  bit  being  . 
r  is  a  power  of  2.  Output  is  x*  mod  2r+l,  where  j*re,  0<e£l/2. 

Function  Modpower  (x,r,s) 
begin 

if  rS4 

Modpowert-x1  mod  2'Vl;  (*  compute  directly  by  constant  depth,  constant  size  circuit 
since  r<A  and  r£2;  this  is  the  base  step  of  recursion  •) 
else 
begin 

case  ris2  do 
begin  ___ 

l*-{\ayJFIs)', 


k«-2V«; 

divide  x  into  k  equal  blocks  of  l  bits  each  and  form  the  vector  (x0>*i , . . .  ,x*_t) 
where  xj  »  C-j-i  *  *  *  (*  this  is  the  vector  g  (t)  in  the  discussion  below 

*) 

(xo,X!  ,x2, „ .  ,x*_i)<-(xo,Xi2,X222,  . . .  ,x*_12*'1)  mod  2*+l; 

(*  this  is  g*  *) 

(x0, . . .  ,x*_t)4-DFrk(xo . xjt_i)modZ3ft+1;(*  thisisD/T^*)  *) 

par  do  Xi*-Modpower(Xi,k,s)\  (*  this  is  the  componentwise  powering  of  k 
smaller  numbers  in  the  transformed  domain,  done  recursively  in  parallel  *) 

(-to xt_i )+-DFTi}  (x0>. ..  Jt-i )  mod  ;  (*  this  is  d*  *) 

(x0.x1,x2,...,xt_i)«-(xo.x12"1,x22_2 . x*_12-(*'l))  mod2*+l; 

(*  this  is  d(t)  *) 

Modpowen-(x0+xl(2tyrx2(2‘)2+:.+xit-i(2‘)k~l)  (*  this  is  d (2‘),  which  is  what 
we  want  *) 
end 

case  r<s 2  do 
begin 

x*-x‘  mod  2'+l  (*  compute  by  the  modular  product  algorithm  in  [Re86]  *) 
end 
end 
end. 


Remarks  on  the  algorithm 

The  main  idea  in  the  algorithm  is  to  split  x  into  k  blocks,  k=2jrs,  and  construct  the  vector 
g  (ty=aQ+a  i  t+a  1t2+...+ak„ltk~l .  If  l=rlk,  then  g(2l)=x.  If  we  let  d(t)m(g(t)Y  mod  tk+ 1,  then 
d(2‘) » (g(2‘)Y  mod  (2')*+l  mx'  mod  2'+l.  Finding  d(t)  would  solve  the  problem,  since  its  value  at  2'  is 
the  desired  x‘  mod  2r+l.  The  polynomials  above  are  over  the  ring  of  integers  Z  but  we  do  ralrnlaiinng  over 
the  finite  ring  Z*yX.  Apply  the  convolution  theorem  to  get  d*=DFTix  (DFTk(g*Y)  mod  2*+l.  where  the 
powering  is  componentwise  in  the  transformed  domain.  Notice  that  there  are  k  powerings  in  the 
transformed  domain,  each  of  k-bit  numbers  mod  2*+l,  where  k  is  smaller  than  r,  and  we  do  these  power¬ 
ings  recursively.  We  need  to  be  sure  that  the  d  computed  as  above  (in  2y)  gives  the  correct  d  (in  Z).  This 
will  be  so  if  the  coefficients  of  d  are  small  enough  Le.,  £2*-1.  This  can  be  arranged  by  requiring  j=rf, 
0<csl/2,  and  by  choosing  i=2Vr?  as  we  have  done.  (It  is  shown  in  [Re86]  that  it  is  sufficient  to  choose  s 
and  k  so  that  2r(r/k+l+logk)£k-l  is  satisfied.  The  above  choices  of  s  and  k  satisfy  this  inequality  for 
sufficiently  large  r.)  Essentially  what  we  are  doing  is  making  each  coefficient  of  g  small  enough  by  split¬ 
ting  x  into  sufficiently  many  pieces.  Since  the  ring  Z grows  with  k  (the  number  of  coefficients  in  g),  if 
we  have  a  sufficiently  large  number  of  coefficients  and  we  make  the  power  s  small  enough,  we  can  expect 
g'  to  have  small  enough  coefficients,  so  that  it  can  be  represented  without  error  in  Z^i . 


Choice  (1)  of  the  case  statement  of  the  algorithm  is  entered  recursively  log(l/e)  times  on  first  calling 
the  program.  We  start  with  k=r  and  in  each  subsequent  application  k*-2'Iks,  and  we  keep  going  until 
k<s2.  We  set  up  the  recurrence  k0=r,  kt=2^k,.j  s  and  solve  to  get  kl=(As)l~ixn^  .  In  about  log(l/e) 
step sk,<s2. 

At  this  point  we  need  the  i-th  power  of  an  r2-bit  number.  Now  choice  (2)  of  the  case  statement  is 
executed.  Since  s=re  is  small  compared  with  r,  we  do  not  attempt  to  be  efficient  with  this  residual  compu¬ 
tation,  but  simply  apply  the  algorithm  in  [Re86].  The  size  complexity  of  the  algorithm  is  dominated  by 
choice  (1)  of  case,  as  our  analysis  will  show. 

5.  Gate  Count 

As  already  mentioned  the  exact  value  of  x‘  may  be  found  by  treating  x  as  an  ns-bit  number  and  com¬ 
puting  x1  mod  2"*+l.  This  is  accomplished  by  calling  Modpower  (x,ns,s).  We  now  compute  the  size  and 
depth  for  this  problem.  We  solved  a  recurrence  for  ki  above  with  initial  condition  ko=r.  With  initial  condi¬ 
tion  ka=ns  (which  is  the  case  in  the  exact  powering  problem),  jfc^=4,-(l/2),  n(U2)/  j;  Recall  that  we  denoted  by 
S  2 (s,n)  the  size  needed  for  computing  the  s-th  power  of  an  n-bit  number.  If  S  ( s,n )  denotes  the  size  of  com¬ 
puting  the  r-th  power  of  an  n-bit  number  mod  2"+l  then  S2(j,n)=$  (sjts)  as  seen  above. 

From  choice  (1)  of  case  in  the  algorithm  we  obtain 

S  (s,nsy=S  (jJk0>=cJtf  log*  x+kiSisJCi) 

The  first  term  on  RHS  is  the  cost  of  taking  DFT  of  a  *- vector  in  Z^+i  by  the  Cooley-Tukey  algorithm 
[CoTu65],  The  second  term  is  for  the  recursive  computation  of  *t  smaller  powerings.  (Computing  g'  from 
g  and  d  from  d"  in  the  algorithm  needs  0 (k{)  per  entry  and  hence  O(Jkf)  overall,  which  follows  from 
lemma  7.6  pp  266  [AhHoU174].  Computing  g  (21)  is  like  adding  two  n-bit  numbers,  and  needs  only  size 
0(k2).  These  steps  in  the  algorithm  are  ail  dominated  by  the  cost  of  computing  the  DFT.)  Now  replace 
S (s,k ,)  by  an  expression  in  terms  of  *2  and  keep  doing  this  for  /  steps  to  get 

S  (^oMc£  (ri*^log*IMn*j)S  (s,k,) 

»■!  /■!  1*1 


Using  the  formula  for  k,  we  get 


(lW/ss4,«,+‘ 

■>i 

and  logk&'logn  since  r=ne  with  eSl/2.  This  gives 

S(s,k0y^”k>gn(j:(Ylkj)kMhkiMsA) 

1*1  jm  1  <«1 

< 

Using  our  previous  formula  for  \\kj  we  get 

y-l 

S  (s,k0)=c"n4IsM  logn+4M+<l'2)V-a'2)V,S  (r.i,) 

With  /=log(l/e)  and  j=/ic  this  becomes 

S(n€,n  ,+e4<kl«(l/*)logn+(l/e2)n 

5  (ne,n2t)  is  the  size  complexity  of  choice  (2)  of  case,  which  is  Ofa4*).  This  gives  us 

S2(ne,n)=S  (n\n  l4*K"(l  lt*)n 

Using  this  in  equation  (1)  we  find  that  the  size  for  computing  reciprocal  is 

So(«S«K"(l/eV*eH,*,/*) 

(We  have  omitted  the  second  term  on  the  RHS  of  eqn  (1)  since  it  is  dominated  by  the  first)  From  mis  it 
follows  that  for  any  5>0,  there  are  circuits  for  computing  the  reciprocal  that  have  size  0((l/54)/t 1+s).  Sup¬ 
pose  8>0  is  given.  We  solve  for  e  using  the  equation  S=4e+elog(l/e),  and  construct  circuits  as  above  with 
this  e.  Clearly  e<5  and  since  for  small  e,  5<ew,  the  above  result  is  true. 

Recall  that  we  denoted  by  T1(nt,n)  the  depth  for  computing  the  n*-th  power  of  an  n-bit  number. 
Note  that  choice  (1)  of  case  is  entered  log(l/e)  times,  and  each  application  is  dominated  in  depth  by  the 
DFT  computation.  The  total  contribution  from  all  this  to  the  depth  is  0(log(l/e)logn).  As  for  choice  (2),  the 
[Re86]  algorithm  needs  Oflogrloglogr)  depth  to  compute  the  r-th  power  of  an  r  bit  number  mod  2r-rl. 
Using  this  with  r=n£  we  see  that  choice  (2)  of  case  contributes  0(elog/tloglog/i)  to  the  depth.  Hence  we 
have 

r2(/»*,«Hog(  l/e)log/»  +elog/i  loglogn 
Using  this  in  equation  (2)  we  get  the  depth  complexity  of  division 

r0  (a  )=(  l/e)(log(  1  /e))loga +logn  loglogn  *0  Qoga  loglogn ) 
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We  observe  that  the  additional  factor  loglog/t  for  the  depth  in  the  above  algorithm  arises  from  the 
final  application  of  the  [Re86]  algorithm  directly  in  choice  (2)  of  case.  We  can  avoid  this  at  the  cost  of  los¬ 
ing  logspace  uniformity  by  using  one  application  of  the  [BeCoHo86]  algorithm  to  complete  the  computa¬ 
tion  once  we  get  to  the  point  where  we  need  to  compute  the  ne-th  power  of  an  n^-bit  number.  Until  this 
point  we  have  used  depth  of  log(l/e)logn  and  with  the  additional  depth  of  elogn  of  the  [BeCoHo86]  algo¬ 
rithm.  the  /ie-th  power  of  an  n-bit  number  can  be  computed  in  depth  0(log(l/e)logn).  Substituting  this  in 
place  of  r2(ne,n)  in  equation  (2)  we  find  that  the  depth  of  the  division  algorithm  incorporating  the 
[BeCoHo86]  circuit  is  0((l/e)log(l/e)logn)  =  0{(l/82)logn)-  The  size  of  this  application  of  the 
[BeCoHo86]  circuit  is  CKn4*),  so  the  size  result  previously  obtained  for  the  algorithm  based  purely  on 
[Re86]  is  not  affected. 

For  PRAM  implementation,  we  note  that  Modpower  is  parametrized  only  by  r  and  is  logspace  uni¬ 
form,  so  our  division  circuit  translates  to  a  0<lognloglogn)  time  algorithm  on  the  CREW  PRAM  ( with  bit 
operations )  using  0((l/54)n1+8)  processors,  for  any  5>0. 

Using  standard  techniques,  (see  [ChSlVi84]),  our  logspace  uniform  0(Iognloglogn)  depth  division 
circuit  can  be  compressed  in  depth  to  0((l/Jk)logn)  for  any  k> 0,  with  an  increase  in  size  of  a  factor  of 
2k>**\  We  let  *=1/2  and  translate  the  resulting  unbounded  fan-in  circuits  to  a  CRCW  PRAM  algorithm  in 
the  bit  model  running  in  time  0(log/i)  and  using  0((l/84)n1+<)  processors,  for  any  8> 0. 
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